Tip: For the explore and summarise data project, the propserLoanData data set is what I chose. The data set has 81 columns & 113937 observations.
Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
Loan Original Amount has a positive skew. The mean value for the LoanOriginalAmount is 8337 which is greater than the median. The median for the LoanOriginalAmount is 6500.
Monthly Loan Payment has a positive skew. The median is less than the mean for this data. The median for the monthly Loan Payment is 217 and the mean is 272.
credit score range lower has a normal distribution. The mean/median in this case is 685.56/680
BorrowerRate also has a normal distribution and the mean/med ratio is 0.19/0.18 However, there are various spikes especially between 0.24 and 0.31.
Debt To Income Ratio exhibits a slight positive skew with mean > median. Mean is 0.275 and median is 0.22. There is a strong observation here. DI ratio is less than 0.5 for a good percentage of pepole.
Stated Monthly Income has a positive skew with mean > median. Mean for the Stated Monthly Income is 5608 and the median is 4666.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 637.5 1061.5 1224.9 1622.4 171004.2 8554
MonthlyDebtTotal has been added as a new column. This was derived by multiplying the debt to Income Ratio to the stated Monthly Income.
Total Monthly debt has a positive skew on its histogram and the mean is greater than the median. The mean is 1224. The median is 1061. The first quartile for this data is 637 and the third quartile is 1622.
The Employment Status Duration exhibits a positive skew. The mean of the data is higher than the median. The mean is 96 and the median is 67. It is observed that more number of employees worked for shorter durations.
Current Credit Lines has a slight positive skew. The mean/median ratio is 10.31/10.
Loans originate mostly during October, December and January
During the year 2009, there was a drop. This can be attributed to the economic crisis during that year.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Most of the loans are either in the completed status or in current status.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
Most of the loans are taken by people with income range between 25k USD to 50k USD and 50k to 75k USD.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00
##
## 12 36 60
## 1614 87778 24545
Most of the loans are of duration 36 months (3 years)
Top 10 states by the number of borrowers were plotted and this the state California topped the list.
Since the most of the loans belonged to the category of “Other”, the data is not fully sufficient to analyse people with what kind of occupation go for the highest loans.
## [1] "Borrowers - Are they home owners?"
## False True
## 56459 57478
Almost 50% of the borrowers are home owners
Employment Status data doesnt seem to be complete. Also, as anticipated, the employed categroy go for the highest number of loans.
Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
## Warning: Removed 16702 rows containing missing values (geom_point).
## $title
## [1] "Current Credit Lines Vs Total Monthly Debt"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
There is a clear positive correlation between Current CreditLines and the Total Monthly Debt and the value of the correlation coeffient is 0.47.
## Warning: Removed 766 rows containing missing values (geom_point).
There is a clear negative correlation betwen the CreditScores and the borrower rate. The correlation coefficient is -0.46. Persons with a lower credit scores get the loans at a lower borrower rate.
## Warning: Removed 11195 rows containing missing values (geom_point).
There is a positive correlation between the stated Monthly Income and the total monthly debt. The correlation coefficient is 0.36.
The correlation between the loan Amount and Borrower Rate is negative. The value of the correlation coeffient is -0.32. This would mean that the higher loans are disbursed at a lower interest rate.
## Warning: Removed 591 rows containing missing values (geom_point).
There is a slight positive correlation between loan Amount and Credit Scores. Higher the loans, the credit scores can increase.
## Warning: Removed 11189 rows containing missing values (geom_point).
Monthly Income Vs Total Monthly Debt was plotted and was facet wrapped by IsBorrowerHomeowner.The dispersion is higher in the case of home owners and for the non home owners, the dispersion is concentrated more amoung the borrowers with an income of around 5000.
The median amount dips in 2009 and has a sharp rise in 2013.
Dec to Feb is the period when the loan amounts are usually higher.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!